IndicIRSuite: Multilingual Dataset and Neural Information Models for Indian Languages

Haq, Saiful; Sharma, Ashutosh; Bhattacharyya, Pushpak

Computer Science > Information Retrieval

arXiv:2312.09508 (cs)

[Submitted on 15 Dec 2023]

Title:IndicIRSuite: Multilingual Dataset and Neural Information Models for Indian Languages

Authors:Saiful Haq, Ashutosh Sharma, Pushpak Bhattacharyya

View PDF HTML (experimental)

Abstract:In this paper, we introduce Neural Information Retrieval resources for 11 widely spoken Indian Languages (Assamese, Bengali, Gujarati, Hindi, Kannada, Malayalam, Marathi, Oriya, Punjabi, Tamil, and Telugu) from two major Indian language families (Indo-Aryan and Dravidian). These resources include (a) INDIC-MARCO, a multilingual version of the MSMARCO dataset in 11 Indian Languages created using Machine Translation, and (b) Indic-ColBERT, a collection of 11 distinct Monolingual Neural Information Retrieval models, each trained on one of the 11 languages in the INDIC-MARCO dataset. To the best of our knowledge, IndicIRSuite is the first attempt at building large-scale Neural Information Retrieval resources for a large number of Indian languages, and we hope that it will help accelerate research in Neural IR for Indian Languages. Experiments demonstrate that Indic-ColBERT achieves 47.47% improvement in the MRR@10 score averaged over the INDIC-MARCO baselines for all 11 Indian languages except Oriya, 12.26% improvement in the NDCG@10 score averaged over the MIRACL Bengali and Hindi Language baselines, and 20% improvement in the MRR@100 Score over the this http URL Bengali Language baseline. IndicIRSuite is available at this https URL

Subjects:	Information Retrieval (cs.IR); Computation and Language (cs.CL)
Cite as:	arXiv:2312.09508 [cs.IR]
	(or arXiv:2312.09508v1 [cs.IR] for this version)
	https://doi.org/10.48550/arXiv.2312.09508

Submission history

From: Saiful Haq [view email]
[v1] Fri, 15 Dec 2023 03:19:53 UTC (224 KB)

Computer Science > Information Retrieval

Title:IndicIRSuite: Multilingual Dataset and Neural Information Models for Indian Languages

Submission history

Access Paper:

References & Citations

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Information Retrieval

Title:IndicIRSuite: Multilingual Dataset and Neural Information Models for Indian Languages

Submission history

Access Paper:

References & Citations

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators